Overview of R for ISOQOL workshop for visualizing data.
R works like a calculator and uses order of operations. For example in the third line of code in the chunck below you will not get the same result as the forth line.
1+1
## [1] 2
74*80
## [1] 5920
4*(3+1)
## [1] 16
4*3+1
## [1] 13
Exercise 1:
Create a new chuck of R code to solve the following:
1. Solve (75-54)*2
2. What is 385 divided by 3?
3. What is 45 squared?
To call upon objects in R we assign them elements so we don’t have to repeat things. We assign elements with the “<-” symbol or the “=” sign.
a <- 2
b <- 1
a+b
## [1] 3
a*b
## [1] 2
R is case sensitive. Let’s do an excerise together create a new element “A” with an upper case to the number 3. Notice in your environment you will have two objects.
The elements we used above are numbers but not everything we do uses numbers so elements can be strings, dates or T/F (boolean). All strings are read as text and wrapped in quotes “” or ’’. Please notice that R will not let you do operations on non-numeric elements. (you can not add a and d)
c <- T
d <- "Hello programmers"
Vectors are a collection of elements. To create a vector we use the letter c and wrap all elements in parenthesis and separate each item by a comma. All elements can be of different types. I like to think the c stands for collection.
v <- c(1,5,10)
z <- c(0,T,"Hey")
Vector “v” has three elements all numbers and “z” has three elements all different types.
Functions are actions we perform on elements (or a collection of elements) through arguments. Arguments are what you give the function to do its job. Please note that all functions use parentheses to collect the arguments.
# The "sum()" function takes a vector of numbers and adds them together.
sum(v)
## [1] 16
# Here's another function "log()".
log(v)
## [1] 0.000000 1.609438 2.302585
# This is an example of a function that can also just take a single number
log(a)
## [1] 0.6931472
# You can also perform multiple functions at once by "nesting" them
sum(log(v))
## [1] 3.912023
There are lot’s of options within RStudio and RStudio Cloud to find help. RStudio Cloud has a bar on the left side with the learning options; including cheat sheets, guides, etc. The “Help” tab in the lower right square allows you to search functions and find documentation on how to use them. Let’s search the R Function “sum”. Notice there is an argument called “na.rm” which removes all NA objects from your vector. Let’s use this below.
v <- c(1,5,10,NA)
sum(v,na.rm=T)
## [1] 16
sum(v)
## [1] NA
Notice that not including the second argument you do not get a number you get NA.
Exercise 2: Find a function we haven’t used yet using any of the following resources and apply it to the vector “v”.
There are special functions that allow you to read in the data. Packages are a collection of functions. To use a function within a package you will first need to install the package with the code “install.packages(‘packagename’)”. Notice we did this in the prep work. Installing a package only needs to be done once. To call upon that package in your environment you need to use
library(readxl)
You will be able to see the packages in your environment in the “Packages” tab of the lower right square of RStudio. Packages will not show up into this the list until you download them. You can also view all of the functions within the package by clicking on the package.
Let’s use the read_xlsx function to read in some example data that we will use for the rest of the presentation and to create a dashboard. There are many different ways to read in data each tailored to the format of your data. For exmaple read.csv or read_xlsx.
# The first arguement to the read_xlsx function the path to where that data lives.
data <- read_xlsx("/projects/bsi/az/projects/radonc/vizlab/s305577_isoqol2022_workshop/data/isoqol_data.xlsx")
# To find the file path to where your data lives you can also use the Files tab in the lower right corner of RStudio
We can see our data is now in our environment and we can view the data by clicking on the dataset in our environment or using the code below.
#View(data)
#To print a single column within our data set we use the $ sign
data$age
## [1] 71 53 53 70 57 46 49 80 60 65 57 49 66 55 47 72 68 56 55 36 72 66 79 62 59
## [26] 45 62 61 48 63 73 63 53 54 49 66 61 62 29 55 66 51 39 62 46 58 81 57 41 72
## [51] 63 65 68 64 71 67 78 70 64 58 73 78 64 76 77 67
# We can use the summary function
summary(data)
## case age arm sex
## Min. : 78841 Min. :29.00 Length:66 Length:66
## 1st Qu.: 91510 1st Qu.:54.25 Class :character Class :character
## Median : 93364 Median :62.00 Mode :character Mode :character
## Mean : 97231 Mean :60.95
## 3rd Qu.:105752 3rd Qu.:68.00
## Max. :112386 Max. :81.00
##
## race wt ht
## Length:66 Min. : 46.40 Min. :150.0
## Class :character 1st Qu.: 66.72 1st Qu.:164.0
## Mode :character Median : 77.70 Median :171.5
## Mean : 80.37 Mean :172.4
## 3rd Qu.: 91.47 3rd Qu.:180.0
## Max. :132.00 Max. :199.0
##
## enroll_dt site tx_start_dt
## Min. :2022-01-04 00:00:00 Length:66 Min. :2022-01-06 00:00:00
## 1st Qu.:2022-01-30 00:00:00 Class :character 1st Qu.:2022-02-02 00:00:00
## Median :2022-03-10 00:00:00 Mode :character Median :2022-03-13 00:00:00
## Mean :2022-03-09 10:54:32 Mean :2022-03-11 15:16:21
## 3rd Qu.:2022-04-11 12:00:00 3rd Qu.:2022-04-13 12:00:00
## Max. :2022-05-14 00:00:00 Max. :2022-05-16 00:00:00
##
## tx_end_dt mo3_dt mo3
## Min. :2022-01-13 00:00:00 Min. :2022-04-15 00:00:00 Length:66
## 1st Qu.:2022-02-09 00:00:00 1st Qu.:2022-05-12 00:00:00 Class :character
## Median :2022-03-19 00:00:00 Median :2022-06-19 00:00:00 Mode :character
## Mean :2022-03-17 09:54:17 Mean :2022-06-17 09:54:17
## 3rd Qu.:2022-04-18 00:00:00 3rd Qu.:2022-07-19 00:00:00
## Max. :2022-05-19 00:00:00 Max. :2022-08-19 00:00:00
## NA's :3 NA's :3
## mo3_due_days mo6_dt mo6
## Min. :-174.000 Min. :2022-07-16 00:00:00 Length:66
## 1st Qu.: 0.000 1st Qu.:2022-08-12 00:00:00 Class :character
## Median : 0.000 Median :2022-09-19 00:00:00 Mode :character
## Mean : -5.111 Mean :2022-09-17 09:54:17
## 3rd Qu.: 0.000 3rd Qu.:2022-10-19 00:00:00
## Max. : 0.000 Max. :2022-11-19 00:00:00
## NA's :3 NA's :3
## mo6_due_days pro1_bsl pro2_bsl pro3_bsl
## Min. :-83.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 0.000 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:3.000
## Median : 0.000 Median :4.000 Median :3.000 Median :4.000
## Mean : 7.677 Mean :3.621 Mean :2.667 Mean :3.879
## 3rd Qu.: 3.000 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :184.000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :1
## pro1_mo3 pro2_mo3 pro3_mo3 pro1_mo6
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.500
## Median :3.000 Median :3.000 Median :3.000 Median :2.000
## Mean :2.607 Mean :3.525 Mean :2.852 Mean :2.349
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :5 NA's :5 NA's :5 NA's :23
## pro2_mo6 pro3_mo6 bsl
## Min. :1.000 Min. :1.000 Length:66
## 1st Qu.:2.000 1st Qu.:1.000 Class :character
## Median :3.000 Median :2.000 Mode :character
## Mean :2.674 Mean :2.372
## 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000
## NA's :23 NA's :23
summary(data$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 54.25 62.00 60.95 68.00 81.00
# The names function gives us all of the variable names within our data set.
names(data)
## [1] "case" "age" "arm" "sex" "race"
## [6] "wt" "ht" "enroll_dt" "site" "tx_start_dt"
## [11] "tx_end_dt" "mo3_dt" "mo3" "mo3_due_days" "mo6_dt"
## [16] "mo6" "mo6_due_days" "pro1_bsl" "pro2_bsl" "pro3_bsl"
## [21] "pro1_mo3" "pro2_mo3" "pro3_mo3" "pro1_mo6" "pro2_mo6"
## [26] "pro3_mo6" "bsl"
Let’s try a few new functions
# The mean function takes the mean or average age within the data set
mean(data$age)
## [1] 60.95455
# The median function takes the median or 50% percentile within the weight
median(data$wt)
## [1] 77.7
Exercise 3:
1. Try a function we have already used on the height column within our data set. Hint our height variable is “ht”.
2. What function would we use if we wanted to know the counts of the arm variable? Find the counts.
To subset the data we are going to learn about some functions within the tidyverse. To call on the tidyverse we must use the code below.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
The tidyverse uses a very special function called the pipe %>% which allows you to easily string together multiple functions in a way that you can read the code like a sentence.
Let’s subset rows with the filter function
arm1 <- data %>% filter(arm=="A: IFL") # This will filter to just the patients within the first arm.
Excerise 4:
1. Filter to the cases that are above 6 feet. Hint our height column is in cm. 6 feet is approximately 183 centimeters.
2. Who is the tallest patient in the dataset?
Let’s subset the data by columns now using the select function.
# Select just the survey data
ex1 <- data %>%
select(
case,pro1_bsl:pro3_mo6
)
A common question that is asked by clinicians is how many unique patients are we including. This is important to ensure that patients are duplicated. To do that we can use the “case” variable that has all of the patient ids and use the “length” and “unique” functions.
length(unique(data$case))
## [1] 66
Notice that the number of unique patients matches the number of rows in our dataset.
We will create new variables with the mutate function and the case_when function within mutate to group our variable.
Let’s create a new age variable that is grouped.
data <- data %>%
mutate(
# case_when is a function that allows you to group the variable
age_group = case_when(
age<=50 ~ "<=50",
age>50 & age<=60 ~ "51-60",
age>60 & age<=70 ~ "61-70",
age>70 ~ "70+",
TRUE ~ NA_character_
)
)
Exercise 5:
1. Create a BMI variable in the data set. The weight is measured in kilograms and height is measured in centimeters. Here is the formula BMI= (weight*10000)/(height^2)
The datatable function within the DT package allows us to view and interact with our data easily.
library(DT)
datatable(data)
This will print the datatable in the Viewer tab in the lower right corner if run in the console. It will also allow you to filter/search the data.
The hist function will create a histogram of a numeric vector. If run in the console it can be viewed within the Plot tab of the lower right hand corner.
hist(data$ht)
#hist(data$bmi)
“boxplot” is a function that create a box plot.
boxplot(data$age)
boxplot(age ~ arm, data = data)
Notice in the second example we create a boxplot of age with respect to arm. Giving the function y ~ x where age is on the y axis and arm on the x axis.
Uses layers and you can string functions together with a “+” sign. The first argument within ggplot is the data and the second is the aesthetics or aes where you will define what your x and y category is and the fill color.
p<-ggplot(data, aes(x=arm, y=age,fill=arm)) +
geom_boxplot()
p <- p + theme_bw() + ylab("Age at Enrollment (yrs)")
p
“theme_bw()” makes the back ground white and changes the grid lines to black with just a single function. “ylab” changes the y-axis lable.
Exercise 6: How would we change the label of the x-axis?
Let’s change the colors of our plot.
p<-ggplot(data, aes(x=arm, y=age,fill=arm)) +
geom_boxplot() + theme_bw() + ylab("Age at Enrollment") +
xlab("Arm") + scale_fill_manual(values=c("red", "blue","green"))
p
Above I used base R colors that are rapped in quotes. For the full list of all possible base R colors here is a link to a list. Base R Colors
You can also you all the colors in the rainbow using HTML color hex codes from the link below. HTML Colors
Exercise 7: Change the colors of the plot above.
“plotly” is a function that makes your graphs interact-able and gives hover options.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ggplotly(p)
These packages will help us make our example dashboard. Let’s read these packages into our environment to create our example.
library(shiny)
##
## Attaching package: 'shiny'
## The following objects are masked from 'package:DT':
##
## dataTableOutput, renderDataTable
library(knitr)
library(flexdashboard)